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Abstract: The performance of the Lasso is well understood under the assumptions 
of the standard linear model with homoscedastic noise. However, in several appli- 
cations, the standard model does not describe the important features of the data. 
This paper examines how the Lasso performs on a non-standard model that is mo- 
tivated by medical imaging applications. In these applications, the variance of the 
noise scales linearly with the expectation of the observation. Like all heteroscedas- 
tic models, the noise terms in this Poisson-like model are not independent of the 
design matrix. 

More specifically, this paper studies the sign consistency of the Lasso under a 
sparse Poisson-like model. In addition to studying sufficient conditions for the sign 
consistency of the Lasso estimate, this paper also gives necessary conditions for sign 
consistency. Both sets of conditions are comparable to results for the homoscedastic 
model, showing that when a measure of the signal to noise ratio is large, the Lasso 
performs well on both Poisson-like data and homoscedastic data. 

Simulations reveal that the Lasso performs equally well in terms of model selec- 
tion performance on both Poisson-like data and homoscedastic data (with properly 
scaled noise variance), across a range of parameterizations. Taken as a whole, these 
results suggest that the Lasso is robust to the Poisson-like heteroscedastic noise. 

Key words and phrases: Lasso, Poisson-like Model, Sign Consistency, Heteroscedas- 
ticity 

1 Introduction 

The Lasso (Tibshirani, 1996) is widely used in high dimensional regression for 
variable selection. Its model selection performance has been well studied under 
a standard sparse and homoskedastic regression model. Several researchers have 
shown that under sparsity and regularity conditions, the Lasso can select the 
true model asymptotically even when p 3> n (Donoho et al., 2006; Meinshausen 
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and Buhlmann, 2006; Tropp, 2006; Wainwright, 2009; Zhao and Yu, 2006). 

To define the Lasso estimate, suppose the observed data are independent 
pairs {(xi,Yi)} G RP x R for i = 1,2, ... ,n following the linear regression model 

Y i = xff3* + e i , (1) 

where xj is a row vector representing the predictors for the ith observation, Y{ 
is the corresponding ith response variable, ej's are independent and mean zero 
noise terms, and (3* G R p . Use X G R nx P to denote the nx p design matrix with 
x\ = (Xfci, . . . , Xfcp) as its fcth row and with Xj = (X-ji, ■ ■ ■ , X jn ) T as its jth 
column, then 



X 



x 
T 



(Xi ,Xz,..., X p 



Let Y = (Yi,...,Y n ) T and e = (ei, £2, • • • , e n ) T G R n . The Lasso estimate 
(Tibshirani, 1996) is then defined as the solution to a penalized least squares 
problem (with regularization parameter A): 

/3(A) = argmin ~||Y - X/3||l + A^Hx, (2) 

where for some vector x G R k , \\x\\ r = (Yli=i \ x i\ r ) 1 ^ r - 

In previous research on the Lasso, the above model has been assumed where 
the noise terms are i.i.d. and independent of the predictors (hence homoskedas- 
tic). We call this the standard model. 

Candes and Tao (2007) suggested that compressed sensing, a sparse method 
similar to the Lasso could reduce the number of measurements needed by medical 
imaging technology like Magnetic Resonance Imaging (MRI). This methodology 
was later applied to MRI by Lustig et al. (2008). The standard model was useful 
to their analyses. However, the standard model is not appropriate for other 
imaging methods such as PET and SPECT (Fessler, 2000). 

PET provides an indirect measure for the metabolic activity of a specific 
tissue. To take an image, a biochemical metabolite must be identified that is 
attractive to the tissue under investigation. This biochemical metabolite is la- 
beled with a positron emitting radioactive material and it is then injected into 
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the subject. This substance circulates through the subject, emitting positrons. 
When the tissue gathers the metabolite, the radioactive material concentrates 
around the tissue. 

The positron emissions can be modeled by a Poisson point process in three 
dimensions with an intensity rate proportional to the varying concentrations of 
the biochemical metabolite. Therefore, an estimate of the intensity rate is an 
estimate of the level of biochemcial metabolite. However, the positron emissions 
are not directly observed. After each positron is emitted it very quickly annihi- 
lates a nearby electron, sending two X-ray photons in nearly opposite directions 
(at the speed of light) Vardi et al. (1985). These X-rays are observed by several 
sensors in a ring surrounding the subject. 

A physical model of this system informs the estimation of the intensity level 
of the Poisson process from the observed data. It can be expressed as a Poisson 
model where the sample size n represents the number of sensors; Y is a vector 
of observed values; (3* represents the Poisson intensity rate for a small cubic 
volume (a voxel) inside the subject; the design matrix X specifies the physics of 
the tomography and emissions process; and p is the number of voxels wanted, 
the more voxels, the finer the resolution of the final image. 

Because the positron emissions are modeled by a Poisson point process, the 
variance of each observed value Yi is equal to the expected value E{Yi). Motivated 
by the Poissonian model, this paper studies the Lasso under the following sparse 
Poisson-like model: 



= X/3* + e, 

(3) 

= <j x diag(\Xp*\), 
11 X{S C ) | X(S), 

where a 2 > and the sparsity index set is defined as 

S = {1 < j < p : j3j ^ 0}, with the cardinality q = #S such that < q < p. 

In the definition of the Poisson-like model, e conditioned on X consists of in- 
dependent Gaussian variables; Cov(e | X), the variance-covariance matrix of e 
conditioned on X, is a 2 x diag{\X.f3*\), an n x n diagonal matrix with the vector 



Y 

E{e | X) 
Cov{e | X) 
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a 2 x |X/3*| down the diagonal; and X(S) and X(S C ) denote two matrices con- 
sisting of the relevant column vectors ( with nonzero coefficients) and irrelevant 
column vectors ( with zero coefficients) respectively. This is a heteroscedastic 
model. 

Since the Lasso provides a computationally feasible way to select a model 
(Efron et al., 2004; Osborne et al., 2000; Rosset, 2004; Zhao and Yu, 2007), it 
can be applied in the non-standard settings to give sparse solutions. In this 
paper we show that the Lasso is robust to the heteroscedastic noise of the sparse 
Poisson-like model. Under the Poisson-like model, for general scalings of p, q, n, 
and /3* , this paper investigates when the Lasso is sign consistent and when it is 
not with theoretical and simulation studies. Our results are comparable to the 
results for the standard model; when a measure of the signal to noise ratio is 
large, the Lasso is sign consistent. This is the first study of the sign consistency 
of the standard Lasso in a heteroscedastic setting. 

1.1 Overview of Previous Work 

The Lasso (Tibshirani, 1996) has been a popular technique to simultaneously se- 
lect a model and provide regularized estimated coefficients. There is a substantial 
literature on the use of the Lasso for sparsity recovery and subset selection under 
the standard homoscedastic linear model. This subsection gives only a very brief 
overview. 

In noiseless setting (when e = 0), with contributions from a broad range of 
researchers (Chen et al., 1998; Donoho and Huo., 2001; E. Candes and Tao., 2004; 
Elad and Bruckstein., 2002, 2003; Tropp., 2004), there is now much understanding 
of sufficient conditions on deterministic predictors {JQ, i = 1, . . . , n} and sparsity 
index S = {j : /3| 7^ 0} for which the true (3* can be recovered exactly. Results 
by Donoho (2004), as well as Candes and Tao (2005) provide high probability 
results for random ensembles of X. 

There is also a substantial body of work focusing on the noisy setting (where 
e is random noise). Knight and Fu (2000) analyze the asymptotic behavior of the 
optimal solution for fixed dimension (p); not only for L\ regularization, but for 
L r regularization with r E (0,2]. Both Tropp (2006) and Donoho et al. (2006) 
provide sufficient conditions for the support of the optimal solution to the Lasso 
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problem (2) to be contained within the support of (3*. Recent work on the use of 
the Lasso for model selection by Meinshausen and Buhlmann (2006), focuses on 
Gaussian graphical models. Zhao and Yu (2006) considers linear regression and 
more general noise distributions. For the case of Gaussian noise and Gaussian 
predictors, both papers established that under particular mutual incoherence 
conditions and the appropriate choice of the regularization parameter A, the Lasso 
can recover the sparsity pattern with probability converging to one for particular 
regimes of n,p and q. Zhao and Yu (2006) used a particular mutual incoherence 
condition, the Irrepresentable Condition, which they show is almost necessary 
when p is fixed. The Irrepresentable Condition was found in Fuchs (2005) and 
Zou (2006) as well. For i.i.d. Gaussian or sub-Gaussian noise, Wainwright (2009) 
established a sharp relation between the problem dimension p, the number q of 
nonzero elements in 0*, and the number of observations n that are required for 
sign consistency. 

1.2 The Contributions in this Paper 

Before giving the contributions of this paper, some definitions are needed. Define 



Definition 1. The Lasso is sign consistent if there exists a sequence \ n such 
that, 



This paper studies the sign consistency of the Lasso applied to data from the 
sparse Poisson-like model, giving non-asymptotic results for both the determinis- 
tic design and the Gaussian random design. The non-asymptotic results give the 



follow from the non-asymptotic results. This paper also gives necessary condi- 
tions for the Lasso to be sign consistent under the sparse Poisson-like model. It 
is shown that the Irrepresentable Condition is necessary for the Lasso's sign con- 
sistency under this model. This condition is also necessary under the standard 



1 if x > 
sign(x) = < if x = 
-1 if x < 0. 



Define = s such that /3(A) = s j3* if and only if sign(/3(A)) = sign(/3*) elementwise. 




probability that /5(A) = s /3*, for any X,p,q, 



and n. The sign consistency results 
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model (Wainwright, 2009; Zhao and Yu, 2006; Zou, 2006). The sufficient condi- 
tions for both the deterministic design and random Gaussian design require that 
the variance of the noise is not too large and that the smallest nonzero element 
of is not too small. Define the smallest nonzero element of /?* as 

M(8*) =min 1/3*1. 

For deterministic design, assume that 

A min (^X(S) T X(S)j > C min > 0, 

where A mm (-) denotes the minimal eigenvalue of a matrix and C m ; n is some 
positive constant; for random Gaussian design, assume that 

Amin(Sn) > Cmin > and A max (£) < C max < oo, 

where Sn E R qxq is the variance-covariance matrix of the true predictors, X G 
R pxp is the variance-covariance matrix of all predictors, A max (-) denotes the 
maximal eigenvalue of a matrix, and C m ; n and C max are some positive constants. 
An essential quantity for determining the probability of sign recovery is the signal 
to noise ratio 

n[M(/3*)] 2 

The numerator corresponds to the signal strength for sign recovery. The most 
difficult sign to estimate in f3* is the element that corresponds to M(/3*). When 
the smallest element is larger, estimating the signs is easier, and the signal is 
more powerful. The noise term in the denominator contains ||/3*||2 because in 
the heteroscedastic model considered in this paper, ||/3*||2 is fundamental in the 
scaling of the noise. Because this paper is addressing the sign consistency of the 
Lasso, the definition of the signal in SNR does not correspond to the typical 
definition. 

When SNR is large, the Lasso is sign consistent. Specifically, the sufficient 
condition for deterministic design requires that 

SNR = n(qlog(p+l) max||ffii(5)|| 2 
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where a n = £l(b n ) means that a n grows faster than b n , that is a n /b n — > oo. The 
sufficient conditions for random Gaussian design requires that 

SNR > 8 log n s/2 max(4g, log n) /C min , 

and 

SNR=n{q\og{p-q+l)). 

The previous three inequalities all require that the signal to noise ratio is large. 
This essential result for the Poisson-like model is identical to the corresponding 
result for the standard model — both require that the variance of the noise is 
small compared to the size of the signal. The simulations in Section 4 support 
this essential result. 

Several of the mathematical techniques in this paper come from Wainwright 
(2009). However, our proofs are more involved because the errors in the Poisson- 
like model are heteroscedastic and dependent on the design matrix. The results 
in this paper require bounding 



— 7T and 



X(S) T e 



When the errors are Gaussian and homoscedastic, the first quantity is distributed 
as x 2 - When the errors are heteroscedastic, the distribution becomes more com- 
plicated. This paper calculates the second moment of ef and bounds e T e/n 2 via 
the Chebyshev inequality. Similarly for the second quantity, when the errors 
are dependent of the design matrix, the variance of X(S) T e is more difficult to 
bound. In this paper, the variance of X(S) T e is bounded using 



P 



max ||x i (S')||2 > 8C' max max(4g,logn) 



1 

< -, 
n 



where Xi(S) is the ith row of X(S). This large deviation result regarding the x 2 
distribution is given in Appendix 3. 

The remainder of the paper is organized as follows. Section 2 analyzes the 
Lasso estimator when the design matrix is deterministic. Section 3 considers 
the case where the rows of X are i.i.d. Gaussian vectors. Both sections give 
(1) sufficient conditions for the Lasso to be sign consistent and (2) necessary 
conditions for the Lasso's sign consistency. In Section 4, simulations demonstrate 
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the fundamental role of SNR and show that the Lasso performs similarly on both 
homoscedastic and Poisson-like data. Section 5 gives some concluding thoughts. 
Proofs are presented in Appendix. 



2 Deterministic Design 



This section examines when the Lasso is sign consistent and when it is not sign 
consistent under the sparse Poissondike model for a nonrandom design matrix 
X. First, some notation, 

Xi (S) = efX(S), 

where ei is the unit vector with ith element one and the rest zero. Because 
S = {j '■ /3j 7^ 0} is the sparsity index set, Xi(S) is a row vector of dimension q. 
Define 

(3*(S) = (/3*) jeS and t = sign(/3*(S)). 

Suppose the Irrepresentable Condition holds. That is, for some constant r\ G 
(0,1], 



X(S c ) T X(S)(X{Sy X(S) 



< 1 — 77. 



(5) 



The ^oo norm of a vector, || • \\oo, is defined as the vector's largest element in 
absolute value. In addition, assume that 



A, 



-X(S) T X(S) ) > c min > 0, 
n 



(6) 



where A m j n denotes the minimal eigenvalue and C m j n is some positive constant. 
Condition (6) guarantees that matrix X(S) T X(S) is invertible. These conditions 
are also needed in Wainwright (2009) for sign consistency of the Lasso under the 
standard model. Define 



*(X,/T,A) = A 
with which: 



-1/2 , 

mm I t 



-X{S) T X(S)] b 
n 



Theorem 1. Suppose that data (X, Y) follows the sparse Poisson-like model 
described by Equations (3) and each column o/X is normalized to li-norm ^fn. 
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Assume that (5) and (6) hold. If X satisfies 

M(/n>*(X,/3*,A), 

then with probability greater than 

f n\ 2 r] 2 , , . 

1 - 2exp <^ - 2 n TgTTj + log (p) 

the Lasso has a unique solution /3(A) with /3(A) = s /3*. 

Theorem 1 can be thought as a straightforward result from Theorem 1 in 
Wainwright (2009). In Wainwright (2009), sign consistency of the Lasso estimate 
is given for a standard model with sub-Gaussian noise with parameter a 2 . In the 
Poisson-like model, since var(ei\xi) = a 2 \xf/3*\ < a 2 maxj ||xj(5)||2||/3*||2 ; the 
noise can be thought of as sub-Gaussian variables with parameter a 2 max, ||xj(5)|| 2 ||/3* ||2- 
To make this paper self-contained, a proof of Theorem 1 is given in Appendix 
1.1. 

Theorem 1 gives a non-asymptotic result on the Lasso's sparsity pattern 
recovery property. The next corollary specifies a sequence of A's that can asymp- 
totically recover the true sparsity pattern. The essential requirements are that 

n\ 2 

maxj ||x i (5)|| 2 || / o*||2log(p + 1) 

Define, 

n 2 SNR 



r(X,/3*,cT 2 



8 max, \\ Xl (S)Hv c£ n /2 + y/q C^J 2 log(p + 1) ' 

Corollary 1. As m Theorem 1, suppose that data (X, Y) follows the sparse 
Poisson-like model described by Equations (3) and each column of X is normal- 
ized to li-norm ^Jn. Assume that (5) and (6) hold. Take X such that 

^ , y (7) 



i/ien /3(A) = s /3* wii/i probability greater than 

1 - 2 exp { - (r(X, /3* , a 2 , a) - l) log(p + 1) } . 
7/ r(X, /3*, o" 2 ) — >■ oo ; i/ien P[/3(A) = s /3*] converges to one. 
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A proof of this corollary can be found in Appendix 1.2. 

This corollary gives a class of heteroscedastic models for which the Lasso 
gives a sign consistent estimate of (3*. This class requires that T(X, f3*, a 2 ) — > oo 
which means that 

SNR = ^fff = Q (<?log(p + 1) ™ \\ Xi (S) || 2 ) , (8) 

where a n = £l{b n ) means that a n grows faster than b n , that is, a n /b n — > oo. In 
other words, this condition requires that SNR grows fast enough. 

For a moment, suppose that the errors are homoscedastic and var(e{) = 
a 2 . The exact same theorem could be proven for this homoscedastic model by 
replacing cr 2 ||/3*||2 with a 2 and removing maxj 1 1 as^ 1 1 2 - This shows that when a 
measure of the signal to noise ratio is large, the Lasso can select the true model 
in both the case of homoscedastic noise and Poisson-like noise. 

The next corollary addresses the classical setting, where p, q, and (3* are all 
fixed and n goes to infinity. While this is a straightforward result from Corollary 
1, it removes some of the complexities and leads to good intuition. Since M(f3*) 
and ||/3*||2 do not change with n, r(X, j3*, a 2 , a) — > oo in Corollary 1 when 
i maxi<j< n ||xi(5)|| 2 ->■ 0. Then: 

Corollary 2. As in Theorem 1, suppose that data (X, Y) follows the sparse 
Poisson-like model described by Equations (3) and each column of X is normal- 
ized to l2-norm yfn. Assume that (5) and (6) hold. In the classical case when 
p, q and f3* are fixed, if 

- max \\xi{S)\\ 2 -> 0, (9) 

n l<i<n 

then by choosing A as in equation (7), 



P 



/3(A) = s F 



1 



as n — >■ oo. 

Condition (9) is not strong and it is easy to be satisfied. Suppose 
< A max f^-X{S) T X(S)^j < C max < oo, 
where A max (-) is the maximum eigenvalue of a matrix and C max is a positive 
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constant, then 



n 



Xi(S) 



n 



eJX(S) 



< A ma;: ( ^X(S) T X(S) ) < C„ 



Consequently, 



1 



— max ||xi(S , )|| 9 
n l<i<n" /uz 



max 

71 l<i<n 



n 



Xi(S) 



< 



c l/2 

*— y may 



0. 



n 



Corollary 2 states that in the classical settings, the Lasso can consistently select 
the true model under the sparse Poisson-like model. 

So far the results have given sufficient conditions for sign consistency of the 
Lasso. To further understand how the sign consistency of the Lasso might be sen- 
sitive to the heteroscedastic model, the next theorem gives necessary conditions 
on the ratio of f3* to the noise level. 

Theorem 2 (Necessary Conditions). Suppose that data (X, Y) follows the sparse 
Poisson-like model described by Equations (3) and each column of X is normal- 
ized to l2-norm ^/n. Assume that (6) holds. 

(a) Consider ^X(S) T X(S) = I gX q- For any j, define 



2o*2 



n 2 p 



X(S) T diag(\Xp*\)X(S) 



(10) 



Define c n = mm,- c n j. Then, for sign consistency, it is necessary that c r < 
oo. Specifically, 



P 



/3(A) = s P* 



< 



1 expj-c^/2} 



(b) If the Irrepresentable Condition (5) does not hold, specifically, 

> 1, 



x(s c fx(s) (x(sfx(s)) x t 



then, the Lasso estimate is not sign consistent: P 



/3(A) = s (3* 



< 



(11) 



A proof of Theorem 2 can be found in Appendix 1.3. 



12 



JINZHU JIA, KARL ROHE AND BIN YU 



Statement (a) would hold for the homoscedastic model by removing diag(\~Kf3*\) 
from the denominator in Equation (10). Equation (10) can be viewed as a corn- 



shows that the signal strength needs to be large relative to the noise level. 

Statement (b) says that the Irrepresentable Condition (5) is necessary for the 
Lasso's sign consistency. This necessary condition can also be found in both Zhao 
and Yu (2006) and Wainwright (2009). Zhao and Yu (2006) points out that the 
Irrepresentable Condition is almost necessary and sufficient for the Lasso to be 
sign consistent under the standard homosedastic model when p and q are fixed. 
Wainwright (2009) says that it is necessary for the Lasso's sign consistency under 
the standard model for any p and q. 

The results in this section have addressed the case when X is fixed. The es- 
sential results say that when SNR is large, the Lasso performs well at estimating 
sign(/3*). In the next section X is random. The randomness of X allows us to 
study how the statistical dependence between X and the noise terms affects the 
sign consistency of the Lasso. When SNR is large, the Lasso is robust to this 
violation of independence. 

3 Gaussian Random Design 

Consider the Gaussian random design where rows of X are i.i.d. from a p- 
dimensional multivariate Gaussian distribution with mean and variance-covariance 
matrix £. Assume the diagonal entries of £ are all equal to one. Define the 
variance-covariance matrix of the relevant predictors to be £n and the covari- 
ance between the irrelevant predictors and the relevant predictors to be £21. 
Specifically, 



Let Amin(-) denote the minimum eigenvalue of a matrix and. A max (•) denote the 
maximum eigenvalue of a matrix. To get the main results that allow p to grow 
with n, the following regularity conditions are needed on the p x p matrix £. 



parison of the signal strength (/3* 2 ) to the noise level (var(Xje)). Theorem 2 
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First, for some positive constants C m \ n and C max that do not depend on n, 

Amin(Sn) > Cmin > and A max (£) < C max < oo, (12) 

and second, the Irrepresentable Condition, 

||S2i(Sn)- 1 "?|| (X> < 1-77, (13) 

for some constant rj G (0, 1]. Assumptions (12) and (13) are standard assumptions 
in the previous research under standard models. Define, 

nC m - m n 



2 4a 2 \\/3* || 2 log ny/2 max(16g, 4 log n) 
A{n, p , a ) = 4 / -~ and 

V ^-^min 

*(n,/3*,A,a 2 ) = A(n, f3* , a 2 ) + 5^?. 

^min 

These quantities defined above are used in the following theorem. 

Theorem 3. Consider the sparse Poisson-like model described by (3), under 
Gaussian random design. Suppose that the variance-covariance matrix £ satis- 
fies condition (12) and condition (13) with unit diagonal. Further, suppose that 
q/n —> 0. Then for any A such that 

M(/T) >§(n,/3*,A,cr 2 ), 

/3(A) = s j3* with probability greater than 

l-2exp(- ff +log(p-g)l-(2g+3)exp{-0.03n}- 1 ^ 3 g. 

I 2y*(n,/3*,A,cr 2 )C max J n 

A proof of Theorem 3 can be found in Appendix 1.4. 

Theorem 3 gives a non-asymptotic result on the Lasso's sparsity pattern 
recovery property, from which the next corollary can be derived. It specifies a 
sequence of A's that asymptotically recovers the true sparsity pattern on a well 
behaved class of models. This class of models restricts the relationship between 
the data (X), the coefficients (/3*), and the distribution of the noise (e). Basically 
speaking, A should be chosen such that 

(1) — ( „f f 2 ^ log(p-<z)^oo and (2) M(/T) > *(n, /?*, A, a 2 ). 

2V*(n,(3*, A,^)C max 
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Define 

f(n,p*,a 2 ) 



nrj 2 



96^||/3*|| 2 log(p-g + l)VC mi 

Aq\og(p -q + l)C max /C min + 



[M(/3*) — A(n, /?*, cr 2 )] 2 Cij 



Corollary 3. As in Theorem 3, consider the sparse Poisson-like model described 
by (3), under Gaussian random design. Suppose the variance-covariance matrix 
£ satisfies condition (12) and condition (13) with unit diagnal. Further, suppose 
that M({)*) > A(n,f3*,a 2 ) and q/n -)• 0. Take A such that 

. _ [M^*)-A(n,/3*,a 2 )]C^ a 
then /3(A) = s (3* with probability greater than 

1 - 2exp|-log(p- q+ l)[f (n, f3* , a 2 ) - 1]} - (2q + 3) exp{-0.03n} 
If 



l + 3q 

n 



/r , / ,m , «[Af(/9*) - A(n,/3*,a 2 )] 2 

n/ glog(p- 9 + l ^oo and 1 v y v ;J -» oo, 14 

i/ien P[/3(A) = s /3*] converges to one. 

A proof of Corollary 3 can be found in Appendix 1.5. 

This corollary gives a class of heteroscedastic models for which the Lasso 
gives a sign consistent estimate of j3*, when the predictors are from a Gaussian 
random ensemble. 

The sufficient conditions require that M(j3*) > A(n, (3* ,a 2 ), which is equiv- 
alent to 

SNR > 8CH log nyjl max(4g, log n). (15) 
The sufficient conditions also require the conditions in (14), which imply that 

SNR=n{q\og{p-q+l)). (16) 

These conditions show that when SNR is large, the Lasso can identify the sign 
of the true predictors. This result is similar to the result for the fixed design case 
in the previous section and it is similar to results on the standard model. 
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The next theorem gives necessary conditions for the Lasso to be sign con- 
sistent. It says that the Irrepresentable Condition is necessary for the sign con- 
sistency of the Lasso under the sparse Poisson-like model. This condition is also 
necessary under the homoscedastic model. 

Theorem 4 (Necessary Conditions). Consider the sparse Poisson-like model de- 
scribed by (3), under Gaussian random design. Suppose the variance- covariance 
matrix £ satisfies condition (12). If the Irrepresentable Condition (13) does not 
hold, specifically, 



A proof of Corollary 4 can be found in Appendix 1.6. 

This section identified sufficient conditions for the Lasso to be sign consistent 
when the design matrix is random and Gaussian. The sufficient conditions are 
similar for both fixed and random design matrices. They are also similar for both 
homoscedastic noise and Poisson-like noise. In all cases, the conditions require 
that a measure of the signal to noise ratio is large, see Equations (8) and (16) 
and the inequality in (15) . In the next section, simulations are used to directly 
compare the performance of the Lasso between the Poisson-like model and the 
standard homoscedastic model. 

4 Simulation Studies 

There are two examples in this section. The first example investigates a peculiar- 
ity of the SNR defined in (4); functions of f3* appear in both the signal and in 
the noise. The second example compares the model selection performance of the 
Lasso under the standard model to the model selection performance of the Lasso 
under the sparse Poisson-like model. In the first example, all data is generated 
from the sparse Poisson-like model. In the second example, the performance of 
the Lasso is compared between the two models of the noise. The parameteriza- 
tions of the standard homoscedastic models differ in only one respect — the noise 
terms are homoscedastic. To ensure a fair comparison, the variance of the noise 
terms in the standard model is set equal to the average variance of the noise 
terms in the corresponding Poisson-like model. 




(17) 



then, the Lasso estimate is not sign consistent: P /3(A) = s j3* < 



l. 

2' 
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All simulations were done in R with the LARS package (Efron et al., 2004). 

Example 1 This example studies how the Lasso is sensitive to the ratio of 
signal to noise defined earlier. Recall, 

n[M(/3*)] 2 

~~ ^Wh ' ( } 

The theoretical results in the previous sections suggest that when SNR is large, 
the Lasso is sign consistent. In SNR, (3* can affect the ratio in two ways. As 
||/3* 1 1 22 grows, so does the variance of the noise term. As [M(/3*)] 2 shrinks, so 
does the signal. This first example investigates both effects, changing the value 
of ||2 while keeping M(f3*) fixed, and vice-versa. 

Consider an initial model with the parameters such that n = 400, p = 1000, 
q = 20, a 2 = 1, and each element of the design matrix X is drawn independently 
from A r (0, 1). Once X is drawn, it will be fixed through all of the simulations. 
This X is also used in Example 2. /3* is designed this way: 



[ /3max if J < 10 

/3min if 11 < j < 20 

otherwise. 



Changing /3 m i n or ||/3*||2 changes the value of SNR. To investigate the role of 
these two quantities, there are two simulation designs. The first simulation design 
fixes M(j3*) = /3 m ; n = 5 and changes the value of /3 max - The second simulation 
design fixes ||/3||2 an d changes the value of M(f3*). There is one model that is 
present in both designs. It sets /3 m ax = 40 and /3 m i n = 5. In this model that is 
common to both designs, ||/3|| 2 = 127 and SNR = 400 x 5 2 /127 ps 78. 

The first simulation design has ten different parameterizations. It sets /3 m i n = 
5 and chooses 

Am* G {100, 90, 80, 70, 60, 50, 40, 30, 20, 10}. 

Each of these ten different parameterizations creates a different value of SNR. 
The second simulation design has ten different parameterizations, each fixing 
||/3* || 2 and altering /3 m ; n such that SNR does not change from the first simulation 
design (to keep ||/3*||2 fixed, /3 max must change accordingly). The values of the 
parameters for the two designs are described in the following two tables. 
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Table 1: The design of the first simulation is described in this table. It shows the 
relationship between \\/3*\\ 2 , M(/3*), and SNR. M((3*) = 5 is fixed. When /? max shrinks, 
|| 2 also shinks, increasing SNR. The numbers in the table are rounded to the nearest 
integer. 

/3 max I 100 90 80 70 60 50 40 30 20 lcT 
11/3*112 317 285 253 222 190 159 127 96 65 35 
SNR 32 35 39 45 53 63 78 104 153 283 



Table 2: The design of the second simulation is described in this table. It shows the 
relationship between M(j3*), /3 max , and SNR. f3 m i n and /3 max are chosen such that 
||/3*||2 = 127 is fixed and SNR keeps the same values as in simulation design one. 
Thus, /3 min and /3 max are decided by SNR and ||/?*|| 2 . I3 min = ^SNR\\/3*\\ 2 /n and 
Anax = \/iI/S*jji7i0 ~ Pmin- The numbers in the table are rounded. 



Anin = M((3*) 


3.2 


3.3 


3.6 


3.8 


4.1 


4.5 


5.0 


5.8 


7.0 


9.5 


Pmax 


40 


40 


40 


40 


40 


40 


40 


40 


40 


39 


SNR 


32 


35 


39 


45 


53 


63 


78 


104 


153 


283 



For each simulation design, the Monte Carlo estimate for the probability of 
correctly estimating the signs is plotted against SNR in Figure 1. Each point 
along the solid line in Figure 1 corresponds to simulation design one (Pmin is 
fixed) , and each point along the dashed line corresponds to simulation design two 

lb is fixed). Success is defined as the existence of a A that makes /3(A) = s f3*. 
The probability of success for each point is estimated with 500 trials. 

Figure 1 shows that as SNR increases, the probability of success also in- 
creases. What is especially remarkable is the similarity between the solid and 
dashed lines. This simulation demonstrates that increasing the elements of j3* can 
both increase and decrease the probability of successfully estimating the signs. 
Further, this simulation demonstrates that these effects are well characterized by 
SNR. 

Example 2 (Comparison to Standard Model) This example compares 
the performance of the Lasso applied to homoscedastic data to the performance 
of the Lasso applied to Poisson-like data. The design matrix X is exactly the 
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Probability of success increases 
as SNR increases 

0) 



D) O 




50 100 150 200 250 
SNR 

Figure 1: Probability of Success vs. SNR in Example 1. For the solid line, ||/3*||2 
decreases while M(j3*) is kept constant. For the dashed line, M(f3*) increases while 
\\f3*\\ 2 is kept constant. The values of M(j3*) and \\(3*\\ 2 are chosen so that SNR as 
defined in (18) takes the values specified on the horizontal axis. Each probability is 
estimated with 500 simulations. 

same (fixed) matrix as in Example 1 and /3* follows the constructions specified in 
Tables (1) and (2). The only difference between Example 1 and Example 2 is that 
the noise terms are homoscedastic in Example 2. To ensure a fair comparison, 
the variance of the noise is always set equal to the average variance of the noise 
terms in the corresponding Poisson-like model. 

There are four lines drawn in Figure 2. The solid lines correspond to sim- 
ulation design one (||/3*||2 grows while M((3*) is held constant). The dashed 
lines corresponds to simulation design two (M(/3*) shrinks while ||/3*||2 is held 
constant). The bold lines correspond to the simulations on homoscedastic data. 
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The thinner lines are identical to the lines in Figure 1 . They are included to com- 
pare the performance of the Lasso on homoscedastic data to the performance of 
the Lasso on Poisson-like data. 

Exactly as in Example 1, success is defined as the existence of a A which 
makes /3(A) = s (3* . The probability of success for each point is estimated with 
500 trials. 

Performance on homoscedastic data is 
similar to performance on Poisson-like data 




2 ° 1 — i h 1 1 1 1 

Q_ 

50 100 150 200 250 
SNR 

Figure 2: The bold solid line nearly covers the thinner solid line. This demonstrates 
how similar the results are for the homoscedastic data and the Poisson-like data. The 
same statement holds for the dashed lines. 

In the Poissondike model, the variance of the noise term is dependent on the 
predicting variables. In the standard model, the noise term is homoscedastic, 
independent of the predicting variables. Yet, Figure 2 demonstrates how similar 
the performance of the Lasso is under these two different models of the noise. 
In the figure, the thinner lines are nearly indistinguishable from the bold lines, 
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showing that the Lasso is robust to one type of heteroscedastic noise. 

5 Conclusion 

There is a significant amount of research dedicated to understanding how the 
Lasso (and similar methods) perform under the standard homoscedastic model. 
However, in practice, data does not necessarily have the features described by 
the standard model. This paper aims to understand if the sign consistency of the 
Lasso is robust to the heteroscedastic errors in the Poisson-like model. This model 
is motivated by certain problems in high-dimensional medical imaging (Fessler, 
2000). The key feature of the model is that, for each observation, the variance 
of the noise term is proportional to the absolute value of the expectation of the 
observation [var{ei) oc |2£(Yi)|). In the Poisson-like model, like all heteroscedastic 
models, the noise term is not independent of the design matrix. 

In this paper, we analyzed the sign consistency of the standard Lasso when 
the data comes from the sparse Poisson-like model, showing that the Lasso is ro- 
bust to one type of violation to the assumption of homoscedasticity. Theorems 1 
and 3 give non-asymptotic results for the Lasso's sign consistency property under 
the sparse Poisson-like model for both deterministic design matrices and random 
Gaussian ensemble design matrices. Followed by these non-asymptotic results, a 
suitable A was chosen such that under sufficient conditions the Lasso is sign con- 
sistent. We also studied how sensitive the Lasso is to the heteroscedastic model 
by finding necessary conditions for sign consistency. The theoretical results for 
the sparse Poisson-like model are similar to results for the standard model when 
SNR are matched. In both models, for the Lasso to be sign consistent, it is 
essential that a measure of the signal to noise ratio is large. The simulations 
demonstrated what our theory predicted, the Lasso performs similarly in terms 
of model selection performance on both Poisson-like data and homoscedastic data 
when the variance of the noise is scaled appropriately. These simulations were 
across multiple choices of f3* and multiple choices of the noise level. Taken as a 
whole, these results suggest that the Lasso is robust to Possion-like heteroscedas- 
tic noise. 
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Appendix 

1 Proofs 

1.1 Proof of Theorem 1 

To prove the theorem, we need the next Lemma which gives necessary and suf- 
ficient conditions for the Lasso's sign consistency. They are important to the 
asymptotic analysis. Wainwright (2009) gives this condition which follows from 
KKT conditions. 

Lemma 1. For linear model Y = X/3* + e, assume that the matrix X(S) T X(S) 
is invertible. Then for any given A > and any noise term e E R n , there exists 
a Lasso estimate /3(A) which satisfies /3(A) = s j3* , if and only if the following two 
conditions hold 



X{S C ) T X{S)(X(S) T X{S))~ l 

sign (p*{S) + {-X{S) T X(S))- 1 
\ n 



-X(S) T e- \sign(/3*(S)) 
n 



-X(S c ) T e 
n 



-X(S) T e- Xsign(/3*(S)) 
n 



< A, 
(19) 

(20) 

where the vector inequality and equality are taken elementwise. Moreover, if (19) 
holds strictly, then 

/3 = (/3 (1) ,0) 



is the unique optimal solution to the Lasso problem (2), where 



/§(!) = pvs) + (-x(sfx(s)r 1 

n 



-X(S) T e- XsignW*) 
n 



(21) 
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As in Wainwright (2009), we state sufficient conditions for (19) and (20). 
Define 

f = sign(/r(S)), 

and denote by ej the vector with 1 in the ith position and zeroes elsewhere. 
Define 



1 



Ui = e{\-X(S) T X(S)) 



n 



-X(S) T e-\t 
n 



Vj = Xj ^X(S)(X{S) T X(S))- 1 \'t - ^X(S)(X(S) T X{S))- 1 X(S) T - /)] ^} . 
By rearranging terms, it is easy to see that (19) holds strictly if and only if 



M(V) 



max I Vj I < A 



holds. If we define M(j3*) = min^s \ f3*\ (recall that S 
sparsity index), then the event 



M{U) = { max | Uj \ < M(/3* 



(22) 

{j : [3* + 0} is the 



(23) 



is sufficient to guarantee that condition (20) holds. Finally, a proof of Theorem 
1. 



Proof. This proof is divided into two parts. First we analysis the asymptotic 
probability of event M(V), and then we analysis the event of Ai(U). 



Analysis of M (V) : Note from (22) that M (V ) holds if and only if 



max., s S c \ Vj\ 



< 



1. Each random variable Vj is Gaussian with mean 



m = xxjx(s)(x(sy x(s))- L 6 



T - 



Define Vj 



I - X(S)(X(S) T X{S))- 1 X{S) 1 then Vj = Hj + Vj- 
Using condition (5), we have \fj,j\ < (1 — 77) A for all j E S c , from which we obtain 
that 



1 



max \ Vj\ < rj 



A Jis^ 1 n ' ■' ' X 



< 1. 



By the Gaussian comparison result (34) stated in Lemma 5, we have 



— max \ Vj\ > ri 
X jas- Jl 



2^2 



< 2(p - q) exp{- 



X z r] 



2max jeSCJ E;(V; 2 ) 



}• 
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Since 



E(V?) = -^XjH[VAR(e)]HXj, 



where H = I - X(S)(X(S) T X(S)y 1 X(S) T which has maximum eigenvalue 
equal to 1, and VAR(e) is the variance-covariance matrix of e, which is a diagonal 
matrix with the ith diagonal element equal to a 2 x |x^/3*|. 

Since \xf/3*\ < y/\\xi(S)\\%\\f3*\\% < max; ||x i (S')||2||/3*||2, an operator bound 
yields 



2 2 

E(V 2 ) < ^max||x i (S)|| 2 ||/3l 2 ||X i || 2 . = — max \\xi{S)\\ 2 W N2 . 



n % 



Therefore, 



— max | Vj | > r\ 
A j 



So, we have 
1 



P 



, max Vd < 1 

A .7 



< 2(p — q) exp 



1 



n\ 2 r] 2 



2a 2 mzxiWxiWhW^h, 



max | Vj | > 77 
A j 



> l-P 

> 1 — 2{p — q) exp 



raA 2 ?7 2 

' 2a 2 1|/3* Hsmaxi ||si(5)|| 2 



Analysis of A4(f7) : 



maxl^l < ||(-X(5) T X(5))- 1 -X(5) T e|| 00 + A||(-X(5) T X(5))- 1 ^|| 00 
«n n n 

Define Z< := (^X(S) 1 X(S))- L ±X(Sy e. Each Z 4 is a normal Gaussian with 
mean and variance 



var(Zi 



ef{-X(S) T X(S))- 1 -X{Sf[VAR(e)}-X(S)(-X(S) T X{S))- 1 e i 
n n n n 



< 



a 



n 

2 II o* I 



; maxj 11^(5)111 



So, for any t > 0, by (34) 

P(max|Zj| > i) < 2gexp{- 

by taking t = 



t 2 nC n 



2a 2 1|/3* Hsmaxi \\xi(S)\\ 2 



}, 



—M=, we have 

V C'min 

P(m&x\Zi\ > —H ] 
i£S y C m i n 



< 2gexp 



n\ 2 rj 2 

'2tr 2 \\l3*\\2maxi\\xi(S)\\: 
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v (c min )- 1/2 + 



Recall the definition of ^(X, (3*, A) = A 
we have 

P(max\Ui\ > #(X,/3*,A)) < 2gexp 

i 

By condition M(/3*) > ¥(X, /?*, A), we have 
P(max|J7i| < M(/T)) > 1 - 2gexp 

i 

At last, we have 

P [M(V)k M(U)] > 1 - 2pexp 



(iX(5) T X(5)) _1 



nX 2 rj 2 


2a 2 \\j3* 


2 maxj 


\xi(S)h 




nX 2 jf 




2fT 2 ||/3*| 


2 maxj 






nX 2 r) 2 


X 



2a 2 ||/3*|| 2 max i ||xi(5)|| ; 



1.2 Proof of Corollary 1 

Proof. Recall the definition of T(X, /?*, a 2 ): 



T(X,f3*,a 2 ) 



rj 2 SNR 



8 max, 11^(5)112(7, C^ 2 + y/qC^J* log(p + 1) ' 
where SiVfl = "'ffffi 2 . So, 

_ 4T(X, 0V 2 )fa C mi f + ^ C^ n ) 2 log(p + 1) 



2 CT 2 ||/3*|| 2 max i 11x^5) || 2 
By taking 



[M(/3*)] 2 



A = 



M(/3* 



we have 



*(X,/3*,A) = A 



r/(C n 



-1/2 



+ 



-1 



A 



V Cmin 2 + ^min 



M(/?*) 



< M(r), 
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and 



" AV r(x,/3V 2 )iog( P + i). 



2a 2 Hsmaxi \\xi(S)\\ 2 
So, the probability bound in Theorem 1 greater than 

1 _ 2 exp { - (r(X, p , a 2 ) - l) log(p + 1) } 

which goes to one when T(X, /?*, a 2 , a) — > oo. 

1.3 Proof of Theorem 2 

Proof. First prove (b). Without loss of generality, assume for some j G <S C , 
XjX(S) (x{S) T X(S)^) ~t = 1 + C, then ^ = A(l + C) + V?, where V,- = 



□ 



(X(S) J X(S*)J X(5) J -T\± is a Gaussian random variable with mean 
0, so P(Vj > 0) = g- So, -P(Vj > A) > |, which implies that for any A, Condition 
(19) (a necessary condition) is violated with probability greater than 1/2. 
For claim (a). Condition (20), 

sign ((3*{S) + (-X(S) T X(S))- 1 -X{S) T e - \sign((3* (S)) 



= sign(P*(S)) 

is also a necessary condition for sign consistency. Since ^X(S) X(S) = I q xq, 
(20) becomes 



-X(S) T e- Xsign(p*(S)) 
n 



sign((3*(S)), 



sign \B*(S) + 
which implies that 

sign (p*{S) + -X(S) T e ) = sign{(3* (S)). (24) 



n 



Without loss of generality, assume for some j £ S, (3* > 0. Then (24) implies 
3* + Zj > 0, where Zj = ej± 



(3* + Z,- > 0, where Zj = eJ±X{S) 7 e is a Gaussian random variable with mean 
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0, and variance 

var(Zj) = ej-X{S) T VAR(e)-X{S) ej 



n n 

T , 



X(Sydiag(\Xp*\)X(S) 



n 2 

i ' 

where the last equality uses the definition of (? n ■ in Theorem 2. To summarize, 
P0(\)= s f3*] < P\p] + Zj>0] 

= p[Zj>-%] 

= P[Zj<Pj\ 

f°° 1 x 2 

1 Jp* y/2TTvar(Zj) 2var{Z j )^ dX 



= 1 



— f -. exp{ — — }dx 



< 1 7= [ (- h T" rrr) exp{ — %r\d,X 

V2^J^/^(z-)l + x {l + xf* yX 2 s 



3 1 

-.1 



exp < 1 



2var(Zj) 



2vr(l + 



y/var(Zj)- 

2 



2ir(l + c n>J -) 

□ 

1.4 Proofs of Theorem 3 

To prove Theorem 3, we need some preliminary results. 

Lemma 2. Conditioned on X(S) and e, the random vector V is Gaussian. Its 
mean vector is upper bound as 



E[V\e,X(S)} \<X(1- V )1. 



(25) 
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(26) 



Moreover, its conditional covariance takes the form 

cov[V\e,X(S)} = M n S 2 , 1 = M n [£ 22 - Sai^u)" 1 ^], 
where 

M n = X 2 't T (X(S) T X(S))- 1 t + \e T [I-X(S)(X(S) T X(S)y 1 X(S) T ]e. (27) 



n 



Lemma 3. Let M l = X 2 b T {X(S) T X{S))~ l b and M 2 = ^e T [I-X(S)(X(S) T X(S))- 1 X(S) T }e, 
then M n = M x + M 2 . We have 



P 



X 2 c 



2nC n 



< Mi < 



2X 2 q 



M 2 > 



> 1 -exp{-0.03n}, 



n 



< 



Lemma 4. 



max ||xj(5)|| 2 > 2C max max (16g, 41ogn) 
i=l, ...n 



< 



n 



(28) 
(29) 

(30) 



Proofs of these lemmas can be found in Appendix 1.7. Now, we prove The- 
orem 3. 

Analysis of M(V): Define the event T = {M n > v*}, where 



2X 2 q | 3a 2 y/C n 



nC D 



n 



By Lemma 3, we have P[T] < exp{— 0.03n} + -. 

Let fXj = E[Vj\e,X(S)], Zj = Vj-fJLj, and Z = {Zj) jeS -, then E[Z\X(S), e] 
and cov{Z\X(S),e) = cov(V\X(S),e) = M n S 2 | 1 . 



max | Vj 

J'6S C 



= max \ua + ZA 

< max[|u,-| + \ Zj\] 

jeS' ; 

< (1 — ri)X + max \ZA. 



From this inequality, we have 



{max|Zj| < i]X} C {max | Vj \ < A}. 
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Define Z to be a zero- mean Gaussian with covariance tj*£ 2 |i- Since 
P 



max \ Zj\ > tjA T c 



< E P [I^I>^A|T C ] 



< (p — q) max P 



\Zj\ > 77 A 



< 2(p- g )exp{--^^}, 



we have 



PfmaxlVil > Al < P 



max I Zj I > A I T c 



+ P[T] 



2 \2 



< 2(p - g) exp{- 



t? 2 A 



2t>*C n 



} + exp{-0.03ra} + -. 

n 



This says that 



7? 2 A 2 1 

P[M(V)\ > 1 - 2(p - 9) exp{ U } - exp{-0.03n} 



Analysis of (J7): Now we analyze maxj £ 5 



7) 



max I Uj I < 

i 



(-X(,S) T X(5))- 1 -X(5) T e 
n n 



+ A 



(-xisfxis))- 1 !? 

n 



Define Aj(-) to be the ith largest eigenvalue of a matrix. Since 



-XiSfXiS))- 1 6 
n 



< 

00 -Amin ( 



X(S) T X(S)) 



by Equation (37) in Corollary 4, we have 



P 



(-xisfxis))- 1 ^ 

11 



< 



2Xy/q 

Cmin . 



> 1 - 2exp(-0.03n). 



Let 



Wi = eJ{-X{S) T X{S))- l -X{S) T e, 



n n 

then conditioned on X(S), Wi is a Gaussian random variable with mean 0, and 
variance 



var{Wi\X{S)) = ef{-X(S) T X(S)y 1 -X{S) T [VAR(e)]-X(S)(-X(S) T X{S)y 1 e i 

n n 

- 2 1|/3* || 2 ma Xi \\xi(S)\\ 2 



n 



< 



nA min aX(S) T X(S)) 
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Using (37) 



Aj(— X T X) > -C m i n 
n Z 



> 1 - 2exp(-0.03n), 



and Lemma 4, we have 



<7 2 ||/3*||2 maxj \\xi 

(5)|| 2 2a 2 ||/3*|| 2 y2C max max(16< ? ,41og 
nA min aX(S) T X(S)) 



nC„ 



with probability no less than 1 — 2exp{— 0.03n} — -. 
Define event 



T 



<T 2 ||0*||2maxi||xi(S)||2 < 2ai/3*|| 2 y2C max max(169,41ogn) 
nA min aX(S) T X(S)) ~ 



nC„ 



then P(T) > 1 — 2exp{— 0.03n} — — . From the proof of Lemma 5, for any t > 0, 



P(|W<| > t | X(S),T) < 2exp( 



2var(Wi \ X{S),T)' 

The above is also true if we replace var(Wi \ X(S),T) with any upper bound. 
So, we have 

P(\Wi\ > t | X(S),T) < 2exp < 



i 1 



, 2(T 2 || 2 y / 2C' m ax max (16g,4 log n) 

^Cmin 



So, 

P(\Wi\>t) < P{\Wi\>t\T) + P{T c ) 



< 2 exp < 



ax max (16q,4 log n) 



+ 2exp{-0.03ra} + 



n 



By takeing t = A(n, /3*,a 



* rr%\ ■- 



4ct 2 ||/3*||2 log n^Jl max( 16<j,4 log n) 

J ^mm 



maxlWil > A(n,/3*,cr 2A 



2g 



< — + 2< ? exp{-0.03n} + 
n n 



we have 

q 



3q 



n 



+ 2gexp{-0.03n}. 
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Summarize, 



max | [/j | > A(n,f3*,a 2 ) + 



Cm i ti 



3q 



<-^ + 2q exp{-0.03n} + 2 exp{-0.03n} . 



At last, we have 



2\2 



P[M(V) & M(U)] < l-2(p-q)exp{ ^ }-(2g+3) exp{-0.03n}- 

2v*C n 



l + 3g 



n 



1.5 Proofs of Corollary 3 

Proof. By taking A = [MOz^|>!)lggls ; we have 



*(n,/3*,A,a 2 ) = A(n, /?*, a 2 ) + ^ v " 



C, 



mm 
* ,-r2\ 



M(/3*) + A(n,/3*,a 2 ) 



< M(f3* 



where the last inequality uses the assumption that M(j3*) > A(n, /3*,a 2y 



A 2 A 2 

y*(n,/3*, A,CT 2 ) 2A 2 g , 3gVg^|j^||2 

nC min n 



2g , 3o- 2 v / C , max 
1 



2<? 48a 2 (jr y 7 (7^11/3*112 



By the definition of T(n, f3*,a ), we have that 



= log(p- (? + l)f(n,/?*,a 2 ) 



2^*(n,/?*,A,(7 2 )C i 
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so the probability bound in Theorem 3 now becomes, 

1 -2expj- ff + log(p -q))-(2q + 3) exp{-cn} - 

= 1 - 2exp|-log(p -qr+ l)f(n, /?*, ct 2 ) + log(p- g)| - (2q + 3) exp{-cn} 

l + 3g 

n 

> 1 - 2 exp { - log(p - q + 1) [f(n, /3* , a 2 ) - 1] } - (2q + 3) exp{-cn} - 

If Condition (14) holds, then r(n, /?*, <r 2 , a) — )■ oo which guarantees P[/3(A) = s 
/?*] ->■ 1. □ 

1.6 Proof of Theorem 4 

Proof. Without loss of generality, assume 

ejZ2i(Z 11 y 1 sign(P*(S)) = l + (, 

for some j G S c and C > 0. Since £[F|X(S),e] = X^ii^n)- 1 sign(P* (S)) , 
conditioned on X(5") and e, is a Gaussian random variable with mean A(l + Q. 
So P[Vj > A(l + ()\X(S), e] = \, which implies P[Vj > X\X(S),e] > \. Then we 
have P(Vj > A) > ^. So for any A, 

P[/3(A) = s (3*\ < P[maxVj < A] < J. 

i 2 

□ 

1.7 Proofs of Lemma 2 — Lemma 4 



Proof of Lemma 2 

Proof. Conditioned on -AT(S') and e, the only random component in Vj is the 
column in the column vector Xj, j 6 5* c . We know that (X(S C )\X(S), e) ~ 
(X(5 C )|X(5)) is Gaussian with mean and covariance 

E[X(S c f\X(S),e] = S 21 (S 11 )- 1 X(5) T , (31) 
var(X(S c )\X(S)) = E 2{1 = £ 22 - Ea^Eii) -1 ^. (32) 
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Consequently, we have, 

\E[V\X(S),e]\ 
= E 21 (£ 11 )- 1 X(S) T {x(5)(X(5) T X(5))~ 1 At 

- X(5*)(X( < S) T X(S))- 1 X(S) T -/1 -) 

J n) 

= lEaiCEii)"^! 
< A(l-r?)l, 

where the last inequality uses Condition (13). 

Now, we compute the elements of the conditional covariance 

cov(Vj,V k \e,X(S)). 

Let a = X(S)(X(S) T X{S))- 1 xt - \x(S)(X(S) T X(S))~ 1 X(S) T — I) 
then Vj = Xja. So we have 



cov{Vj,V k \e,X{S)) = a 1 cov(Xj ,X£ \e,X(S))a = [var(X(S c )\X(S))] jk a 1 a. 
Consequently, 

cov{V\e,X(S)) = a T avar(X(S c )\X(S)) = a T = a T a[S 2 2-S 2 i(Sii) _1 Si2]. 



By careful calculation, we have a T a = M n . 

Proof of Lemma 3 
Proof. Recall that M x = \~t T {X{S) T X(S))~ 1 't . So, 



□ 



A mSLX (X(S) T X(S)) 
From (37) we have, 



< Mi < 



A min (X(SY X(S)) 



p 



2nQ 



„ , 2Ag 

< Mi < 



nC n 



> 1 - 2exp(-0.03ra). 



Define q = E\\Z\], where Z ~ iV(0, 1), then for any random variable R 
N{0,a 2 ), E[\R\] = o~ q. Since xffi* ~ iV(0, /3*(S) T £n/3*(,S)), we have 



E\\x 



iP*\] = ^/(3*(S) T Z 11 P*(S) g . 
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We know that M 2 < ^e r e. Since E[e 2 } = E[E[e 2 \X(S)}] = E[a 2 \xJ^ u 
a 2 Jp*(S) T Z n f3*(S)Q, and£[e 4 ] = E[E[ef \X(S)]} = 3E[a 4 \xf /3| 2 ] = 3a 4 /3*(S) T £ n /3*(S), 



we have 



> 



?? 



P 



< 



nvariet 



n 2 a 4 (3-£ 2 )/3*(S) T £ n /3*(,S) 
3<T 4 /3*(S) T £ n /3*(,S) - cj 4 /3*(5) t Sii/3*(,S)^ 2 



na 4 (3-£» 2 )/3*(5) T Sn/3*(5) 



1 

n 



So, 



M 2 > 



a 2 (g+ y^^)J f3*(S) T ^*(S) 



n 



1 

< . 

n 



While ^pf^ u p*(S) < y/C^WPh and e = £(|Z|) < ^Ei\Zf) = 1, where Z 
is a standard normal random variable, so 



^ 2 (g + >/3^)V^(g) r Sii^(g) < 3<xV(7 max 

n ~~ n 



Then we have 



n n 



□ 



Proof of Lemma 4 



Proof. By lemma 6, we have for any t > q, 



P 



max ||S 11 2 x i (S')||l > 2t 

i=l, ...n 



< nexp(— i 



1 - 2 



34 JINZHU JIA, KARL ROHE AND BIN YU 

Take t = max (16q, 41ogn), we have 



exp(— t 



1-2,/f 



) < exp(-i 1 - 2 A 
= exp(-|) 



< 



so, 



max ||S 11 2 Xj(S')||2 > 2 max (16^,4 log n) 



= l,...7l 



< 



n 



Since ||S 11 2 x i (S')||| > ^— we have 

(-'max 



max || Xi (S) || 2 > 2(7 max max(16</,41ogn) 

i=l, ...n 



< 



77, 



(33) 
□ 



2 Some Gaussian Comparison Results 



Lemma 5. For any mean zero Gaussian random vector (X±, . . . , X n ), and t > 0, 



r 

P( max |JQ| > t) < 2nexp{ 

v i<i<«' 11 - 7 - Fl 2max i F(X2) / 

Proof. Note that the generate function of Xj is 



So, for any t > 0, 



P(Xi >x) = P(e tx ' > e tx ) < K ' 



exp{ 



art}, 



by taking t 



, we have 



(34) 



P(Xi >x)< exp{- 



2E(X t 



So, 



P(pQ| > t) = 2P(X, > t) < 2exp{-^^y} < 2exp{- 



) 

2max i P(X l 2 ) ; ' 



So, 



P( max \XA > t) < 2nexp{ — — ~-}. 

v i<i<n' 41 - ' ~ Fl 2maxiP(X2) ; 



□ 
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3 Large deviation for % distribution 

Lemma 6. Let Z\, . . . , Z n be i.i.d. x 2 -variates with q degrees of freedom. Then 
for all t > q, we have 

Iq 



P 



max Zi > 2t 
i=l,...,n 



< nexp(— t 



1 - 2 



)■ 



(35) 



The proof of this lemma can be found in Obozinski et al. (2008). 



4 Some useful random matrix results 

In this appendix, we use some known concentration inequalities for the extreme 
eigenvalues of Gaussian random matrices (Davidson and Szarek, 2001) to bound 
the eigenvalues of a Gaussian random matrix. Although these results hold more 
generally, our interest here is on scalings (n, q) such that q/n — > 0. 

Lemma 7 (Davidson and Szarek (2001)). Let T £ R nxq be a random matrix 
whose entries are i.i.d. from N(0,l/n), q < n. Let the singular values of T be 
si(T) > ... > s q (T). Then 

<l 

n 

Using Lemma 7, we now have some useful results. 

Lemma 8. Let U £ R nXQ be a random matrix with elements from the standard 
normal distribution (i.e., Uij ~ N(0,1), i.i.d.) Assume that q/n — > 0. Let the 
eigenvalues of ^U T U be Ai(^U T U) > ... > A q (^U T U) . Then when n is big 
enough, 



max < P 



si{T) > 1 + 



t 



P 



s q (T) < 1 



t 



< exp{-nt 2 /2}. 



1 < H l u T u) < 2 

2 n 



> 1 - 2exp(-0.03n). 



(36) 



Proof. Let V = -±=17, then Ki{}-U T U) = sf{T). By Lemma 7, 



P 



s q (T) < 1 



by taking t = to = 1 — ^ — 0.1, we have 



P 



Sq(T) < ^+0.1 



< exp{-nt /2}, 



< exp{-nto/2}. 
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Since q/n — > by assumption, we have when n is big enough, y/q/n < 0.1, then 

V2~ 



which implies that, for any i = 1, . . . , q, 



< exp{— nt /2}, 



n 2 



Followed the same procedures, 



P 



ki{-{U T U)) > 2 
n 



< exp{-ra^/2}. 



< exp{-ntj/2}, 



for t\ = \T2— 1.1. Then inequality (36) holds immediately. 



□ 



Corollary 4. Let X G R nxq be a random matrix, of which, the rows are i.i.d. 
from the normal distribution with mean and covariance E. Assume that < 
C m in < Aj(E) < C max < oo and q/n — > 0, then when n is big enough, 



P 



-C'min < Aj( — X T X) < 2C D 

2 n 



> 1 - 2exp(-0.03n). 



(37) 



Proof. Let U = JE 2 ; then U satisfies the condition in Lemma 8. Then 



P 



\ < H-u T u) < 2 

2 n 



> 1 - 2exp(-0.03n). 



Since 



C min Ai(-U T U) < A i ( 1 X T X) < C m ^K q ( l U T U), 
n n n 



result (37) is obtained immediately. 



□ 
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